# Expert Parallel (EP) Expert Parallel distributes Mixture-of-Experts (MoE) model experts across multiple GPUs, allowing each rank to hold a subset of experts. This reduces per-GPU memory and enables training of large MoE models. ## Overview | Concept | Description | |---------|-------------| | **ExpertParallelConfig** | Configuration dataclass controlling EP behavior | | **apply_expert_parallel()** | Entry point that shards experts and patches forward | | **shard_experts()** | Evenly splits experts across EP ranks | | **patch_forward()** | Replaces MoE block forward with EP-aware all-to-all communication | ## Configuration ```python from twinkle.model.transformers.moe.expert_parallel import ExpertParallelConfig config = ExpertParallelConfig( enabled=True, # Enable expert parallel router_dtype='fp32', # Router computation dtype: 'fp32', 'bf16', 'fp16' keep_router_logits=True, # Return router logits alongside hidden states ignore_shared_experts=False,# Skip shared expert computation (e.g. DeepSeek) ep_size=None, # EP world size (consumed by TransformersModel) ) ``` ## Usage with DeviceMesh EP is activated by setting `ep_size` in `DeviceMesh.from_sizes()`. The framework automatically calls `apply_expert_parallel()` during model initialization. ```python from twinkle.utils import DeviceMesh # 8 GPUs: 2-way EP × 4-way data parallel device_mesh = DeviceMesh.from_sizes( world_size=8, dp_size=4, ep_size=2, ) ``` For combined EP + FSDP sharding on the expert parameters: ```python # 8 GPUs: 2-way EP with FSDP within each EP group device_mesh = DeviceMesh.from_sizes( world_size=8, dp_size=2, ep_size=2, ep_fsdp_size=2, ) ``` ## Communication Pattern The EP forward pass follows a 4-stage pipeline: 1. **Preprocess** — compute per-expert token counts and split sizes 2. **Token Pre-All2All** — permute tokens by expert assignment, then all-to-all exchange across EP ranks 3. **Expert Compute** — each rank runs its local experts on received tokens 4. **Token Post-All2All** — all-to-all exchange results back, unpermute and apply routing weights ``` Input tokens → Router → [preprocess] → [pre_all2all] → [local experts] → [post_all2all] → Output ``` ## Requirements - `num_experts` must be divisible by `ep_size` - `torch.distributed` must be initialized - MoE blocks must define a `gate`/`router` module and `experts` (either `nn.ModuleList` or tensor-style `gate_up_proj`/`down_proj`) - Both ModuleList-style and tensor-style (fused) experts are supported - Shared experts (e.g. DeepSeek MoE) are handled automatically unless `ignore_shared_experts=True`